Recommendation Systems - Module Project


India is the second largest market globally for smartphones after China. About 134 million smartphones were sold across India in the year 2017 and is estimated to increase to about 442 million in 2022. India ranked second in the average time spent on mobile web by smartphone users across Asia Pacific. The combination of very high sales volumes and the average smartphone consumer behaviour has made India a very attractive market for foreign vendors. As per Consumer behaviour, 97% of consumers turn to a search engine when they are buying a product vs. 15% who turn to social media. If a seller succeeds to publish smartphones based on user’s behaviour/choice at the right place, there are 90% chances that user will enquire for the same. This Case Study is targeted to build a recommendation system based on individual consumer’s behaviour or choice.

• author : name of the person who gave the rating
• country : country the person who gave the rating belongs to
• date : date of the rating
• domain: website from which the rating was taken from
• extract: rating content
• language: language in which the rating was given
• product: name of the product/mobile phone for which the rating was given
• score: average rating for the phone
• score_max: highest rating given for the phone
• source: source from where the rating was taken

We will build a recommendation system using popularity based and collaborative filtering methods to recommend mobile phones to a user which are most popular and personalised respectively.


1. Import and explore the data.

There are 4% of missing values in the score (target variable) and score_max columns; Hence, we drop these rows from the data frame as we can't use these data points for building our recommendation systems. Also, we drop score_max column as it has only a single value 10 for all non-missing data points. Hence, it adds no information

There is one value missing in 'prouct'; However, the variations in each phone model by it's description doesn't make it a new phone and we might not have enough data for each variation of the phone to recommend individual variants of each phone. So, we drop the phone column and preserve only a new column named 'phone_model' which has the information about the main phone model name.

Engineered Features: year, phone_model, company

Given the rapid acceleration of technology, products from older than 2006; i.e, before the smartphone era might not be relevant to the modern era and might not even be available for purchase anymore. So, we drop these phones with reviews older than 2006

Keep only 1000000 data samples. Use random state=612

2. Analyze

• Identify the most rated features.
• Identify the users with most number of reviews.
• Select the data with products having more than 50 ratings and users who have given more than 50 ratings. Report the shape of the final dataset.

Most phones have high ratings, followed by neutral ratings followed by very less low rated phones.

A few companies like samsung, nokia, lg, sony... seem to have the most number of reviews in our dataset indicative of their market share in the phones market in the given timeperiod fo 2007-2017. Also, we can see that most ratings for all companies seem to be high, and relatively less neutral ratings

Most reviews are from the years 2014, 2015 and 2016 in our dataset. This might cause unnecessary bias for that time period and the predictions might not extend well into the future if the dataset is static

A few phones like galaxy s6, s7 and moto g ..etcs., have relatively the most number of reviews and also the most number of high ratings than low, neutral ratings. iphone 5s and other non-flagship phones at the time have good nunmber of high ratings but relatively more number of low, neutral ratings as well

The top 40 users with the most amount of reviews. Some users seem to be generic accounts (Amazon Customer, Client Amazon, Anonymous, Anonimo, unknown...) with many reviews. This might affect personlization of the user-based collaborative filtering.

Reviews in 21 differente languages,transalte other languages to english to use those reviews

3. Build a popularity based model and recommend top 5 mobile phones

Hence, we recommend the top 5 phones from the above table using a simple popularity based model

4. Build a collaborative filtering model using SVD: Both user-based and item-based nearest neighbor models.

5. Evaluate the collaborative model | RMSE

SVD

KNNWithMeans

Cross Validate

6. Predict score (average rating) for test users

7. Report your findings and inferences.

The best models seem to be Item-based and User-Based Collaborative filtering with KNNMeans fit with cross validation

8. Try and recommend top 5 products for test users

9. Check for outliers and impute them as required

11. In what business scenario you should use popularity based Recommendation Systems ?

• Recommend products rated high by all users.
• It works without having information on the user.
• It's not personalized for specific users, It uses a simple frequency based recommendations.
• A common approach is to use collaborative filtering whenver we have enough data to avoid cold-start and grey-sheep problems and fall-back to a simple popularity based recommendation system whenever such a problem is there or when we have no data on the current user.

12. In what business scenario you should use CF based Recommendation Systems ?

• Recommend products rated high by users similar to current users (user-based) or items rated similar to current item (item-based)
• It doesn't require any information about the users or the content of the review ..etc., Only the ratings given by other users for the items is sufficient
• Might show you unrelated products if you do it a high-level all-products at once.
• Suffers from cold-start and grey-sheep problem • Cold-start: New products or new users with no ratings or history columns or rows → Use a hybrid approach with fall-back to content-based reccomendation system
• Grey-sheep problem: One or two ratings for a few products, but none else in the crowd, rated them highly. No neighbours to find -> Switch to popularity based or content-based for the specific user...
• Content-based models can be used to solve the Cold Start and Gray Sheep problems in Collaborative Filtering
• Have to do it a category-level, sub-category level granularity
• de-mean the item rating data to remove item bias
• Generally both user-based and item-based are used to give recommendations
• A common approach is to use collaborative filtering whenver we have enough data to avoid cold-start and grey-sheep problems and use content-based recommendation when we don't have data on user but have enough information on the product (description, reviews...) and fall-back to a simple popularity based recommendation system whenever such a problem is there or when we have no data on the current user.

13. What other possible methods can you think of which can further improve the recommendation for different users

• Deep Learning based models: LSTMS, BiLSTMS (with Bidirection Context), Transformers, BERT .. will be much more effective to learn about sentiment from the reviews using various word embeddings. Several libraries like flair, HuggingFace, keras could be used to build review rating classfifiers.
• Preprocessing of reviews will help the content based models learn a lot better by stripping stop words, removing puncutation, extracting keywords ... etc.,
• Using a Hybrid model with various techniques will almost always yield better results.
• A common approach is to use Latent Factor models for high-level recommendation and then improving them using content-based systems by using the information on users or item